Sentence-Length Informed Method for Active Learning Based Resource-Poor Statistical Machine Translation
نویسندگان
چکیده
This paper presents a simple but effective sentence-length informed method to select informative sentences for active learning (AL) based SMT. A length factor is introduced to penalize short sentences to balance the “exploration” and “exploitation” problem. The penalty is dynamically updated at each iteration of sentence selection by the ratio of the current candidate sentence length and the overall average sentence length of the monolingual corpus. Experimental results on NIST Chinese–English pair and WMT French-English pair show that the proposed sentence-length penalty based method performs best compared with the typical selection method and random selection strategy.
منابع مشابه
استخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملSource Error-Projection for Sample Selection in Phrase-Based SMT for Resource-Poor Languages
The unavailability of parallel training corpora in resource-poor languages is a major bottleneck in cost-effective and rapid deployment of statistical machine translation (SMT) technology. This has spurred significant interest in active learning for SMT to select the most informative samples from a large candidate pool. This is especially challenging when irrelevant outliers dominate the pool. ...
متن کاملActive Learning for Multilingual Statistical Machine Translation
Statistical machine translation (SMT) models require bilingual corpora for training, and these corpora are often multilingual with parallel text in multiple languages simultaneously. We introduce an active learning task of adding a new language to an existing multilingual set of parallel text and constructing high quality MT systems, from each language in the collection into this new target lan...
متن کاملMulti-Strategy Approaches to Active Learning for Statistical Machine Translation
This paper investigates active learning to improve statistical machine translation (SMT) for low-resource language pairs, i.e., when there is very little pre-existing parallel text. Since generating additional parallel text to train SMT may be costly, active sampling selects the sentences from a monolingual corpus which if translated would have maximal positive impact in training SMT models. We...
متن کامل